Privacy-Preserving Analysis with Local LLMs: Generating Synthetic Data Proxies

Author

Martin Schweinberger

Introduction

Researchers working with sensitive data face a persistent dilemma. Modern AI assistants — such as Claude, ChatGPT, and Gemini — are extraordinarily helpful for writing analysis code, reformatting data, and generating R scripts. But using them requires uploading your data to external servers. For many research datasets, this is simply not permitted: institutional ethics approvals, data governance policies, and participant consent agreements routinely prohibit sending identifiable or sensitive information to third-party platforms.

This showcase presents a practical solution to that dilemma using local large language models via Ollama. The core idea is straightforward:

  1. Describe the structure of your real, sensitive data to a local LLM running entirely on your own machine
  2. Generate a synthetic dataset that mirrors the structure, format, and content of the real data — but contains no real participants or real information
  3. Upload the synthetic dataset to a cloud AI assistant (Claude, ChatGPT, etc.) and ask it to write the R analysis code you need
  4. Run that code locally on your real data

At no point does any real participant data leave your machine. The cloud AI sees only synthetic examples; your real data stays local throughout.

This showcase demonstrates the complete workflow for two types of sensitive data that are common in clinical and language research: conversation transcripts from patient interviews, and tabular data from clinical assessments. Both examples are drawn from a realistic clinical linguistics research scenario.

Prerequisite Tutorials

Before working through this showcase, you should be comfortable with:

You will need Ollama installed and the llama3.2 model downloaded before running any code in this showcase. See the Ollama tutorial for setup instructions.

Learning Objectives

By the end of this showcase you will be able to:

  1. Explain the privacy argument for using local LLMs as a synthetic data generation step before engaging cloud AI assistants
  2. Craft prompts that instruct a local LLM to reproduce the structure, format, and statistical properties of a sensitive dataset without reproducing any real content
  3. Generate synthetic conversation transcripts that mirror the linguistic features of real interview data
  4. Generate synthetic tabular data that mirrors the variable names, data types, and value distributions of a real clinical dataset
  5. Use a synthetic dataset as a proxy when requesting analysis code from a cloud AI assistant
  6. Verify that R code generated from synthetic data runs correctly on real data
Citation

Schweinberger, Martin. 2026. Privacy-Preserving Analysis with Local LLMs: Generating Synthetic Data Proxies. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/localllm_showcase/localllm_showcase.html (Version 2026.05.01).


The Privacy Problem

Section Overview

What you will learn: Why sensitive research data cannot be uploaded to cloud AI services; the specific ethics and governance constraints that apply; and why a synthetic proxy approach addresses these constraints without sacrificing analytical capability

Why cloud AI services are off-limits for sensitive data

Consider a researcher studying language use in patients with mild cognitive impairment. She has collected 80 recorded and transcribed interviews with patients. Each transcript contains:

  • The patient’s name (or pseudonym linked to a key file)
  • Detailed descriptions of their cognitive difficulties, daily life, and emotional state
  • Potentially identifying information about family members and healthcare providers

She wants to use Claude or ChatGPT to help her write R code that extracts linguistic features from the transcripts — but her ethics approval explicitly states that participant data may only be stored and processed on university-approved infrastructure. Sending the transcripts to Anthropic’s or OpenAI’s servers would breach this condition.

This is not an unusual situation. Across disciplines, data governance constraints commonly prohibit uploading:

  • Clinical data — patient records, therapy session transcripts, cognitive assessment scores
  • Survey data with open-ended responses — especially when topics are sensitive (mental health, sexuality, immigration status, criminal history)
  • Educational data — student assessment records, learning disability diagnoses
  • Legal data — witness statements, court transcripts, case records
  • Sociolinguistic fieldwork data — recordings and transcripts from vulnerable communities, minority language speakers, or speakers who have given consent for specific uses only

What the researcher actually needs

In most cases, the researcher does not need the AI assistant to analyse the data — she needs it to write code that she will then run locally. The AI’s job is to understand the data structure and produce syntactically correct, well-commented R code. For this task, the AI does not need real data. It needs data that looks like real data.

The synthetic proxy solution

A synthetic data proxy is a dataset that:

  • Has exactly the same structure as the real dataset (same columns, same variable types, same format)
  • Has the same statistical and linguistic properties as the real dataset (similar distributions, similar vocabulary, similar text length)
  • Contains no real participants, no real measurements, and no identifying information

With a good synthetic proxy, a cloud AI assistant can write analysis code that runs on the real data without ever having seen any of it.

Why a local LLM for generation?

You might ask: why not generate synthetic data manually, or use a statistical package like synthpop? For tabular data, statistical synthesis packages are often a good choice — we discuss this below. But for text data (transcripts, open-ended responses, narrative descriptions), generating realistic synthetic text manually is time-consuming, and most statistical synthesis methods do not apply. A local LLM can produce realistic synthetic transcripts in seconds, guided by a detailed prompt that describes the real data’s structure and content without exposing any actual participant information.

The local LLM also handles the generation step with complete privacy: the description of your sensitive data never leaves your machine.


The Workflow

The complete workflow consists of five steps, which we will work through in detail for each data type:

Step 1: Describe the real data structure to the local LLM
          (no actual participant data in the prompt)
              │
              ▼
Step 2: Local LLM generates a synthetic proxy dataset
          (runs entirely on your machine — no data leaves)
              │
              ▼
Step 3: Upload the synthetic proxy to a cloud AI assistant
          (Claude, ChatGPT, etc.)
          + describe what analysis you want
          + ask for R code
              │
              ▼
Step 4: Cloud AI returns R code based on the synthetic data
              │
              ▼
Step 5: Run the R code locally on your real data
          (the real data never left your machine)
Always verify generated code on synthetic data first

Before running AI-generated code on your real sensitive dataset, test it on the synthetic data to confirm it runs without errors and produces sensible output. Only then run it on the real data.


Setup

Code
# Install required packages
install.packages(c(
  "ollamar",    # R interface to Ollama
  "dplyr",      # data manipulation
  "tibble",     # tidy data frames
  "stringr",    # string processing
  "readr",      # reading and writing CSV files
  "purrr",      # functional iteration
  "flextable",  # formatted tables
  "jsonlite"    # JSON parsing
))
Code
library(ollamar)
library(dplyr)
library(tibble)
library(stringr)
library(readr)
library(purrr)
library(flextable)
library(jsonlite)
Code
# Verify Ollama is running before proceeding
ollamar::test_connection()
<httr2_response>
GET http://localhost:11434/
Status: 200 OK
Content-Type: text/plain
Body: In memory (17 bytes)
Code
ollamar::list_models()
                     name   size parameter_size quantization_level
1         llama3.2:latest   2 GB           3.2B             Q4_K_M
2 nomic-embed-text:latest 274 MB           137M                F16
             modified
1 2026-03-20T08:40:36
2 2026-03-20T08:40:37
Ollama Must Be Running

All ollamar calls in this showcase require Ollama to be installed and running as a background service. If you have not yet installed Ollama, see the setup section of the main Ollama tutorial.

Pull the model used in this showcase if you have not already done so:

ollama pull llama3.2

Part 1: Synthetic Conversation Transcripts

Section Overview

What you will learn: How to describe the structure and content of a sensitive interview transcript to a local LLM; how to prompt the model to generate a realistic synthetic version; how to verify the synthetic transcript is structurally equivalent to the real data; and how to use the synthetic transcript to get analysis code from a cloud AI assistant

The research scenario

A clinical linguist has collected transcribed interviews with 40 patients diagnosed with early-stage Alzheimer’s disease. Each interview follows a semi-structured format in which the clinician asks a series of standard questions and the patient responds. The transcripts use a simplified CHAT-like notation: speaker turns are marked with *CLI: (clinician) and *PAT: (patient), pauses are marked with (.) for short pauses and (..) for longer ones, and incomplete words are marked with a trailing hyphen.

The researcher wants to use a cloud AI assistant to write an R script that:

  1. Reads the transcript files
  2. Extracts all patient turns
  3. Calculates mean utterance length per patient
  4. Identifies and counts filled pauses (uh, um, er)
  5. Outputs a summary table

She cannot upload real transcripts to the cloud AI — but she can generate a synthetic transcript that has the same format and similar linguistic properties, and use that as a proxy.

Step 1: Describe the real data structure

The first prompt describes the data structure without including any real participant data. We tell the model what the format looks like, what linguistic features are present, and what range of language we expect — but we do not paste in any actual transcript.

Code
# Step 1: Describe the data structure to the local LLM
# No real participant data is included in this prompt

structure_description <- "
I am working with transcripts of clinical interviews with patients who have
early-stage Alzheimer's disease. I need you to generate a realistic synthetic
example transcript that I can use as a proxy when asking a cloud AI assistant
to write R analysis code. The synthetic transcript must NOT contain any real
participant information.

The transcripts use this format:
- Speaker turns are marked with *CLI: (clinician) or *PAT: (patient)
- Short pauses within a turn are marked with (.)
- Longer pauses are marked with (..)
- Incomplete or abandoned words end with a hyphen, e.g. 'I was go- going'
- Filled pauses (uh, um, er) appear in the text as spoken
- Each turn is on its own line
- The transcript begins with a @Begin marker and ends with @End
- Metadata lines at the top use @ notation: @Participants, @Date, @Location

The interviews typically:
- Last about 15 minutes (approximately 80-120 speaker turns total)
- Follow a semi-structured format where the clinician asks standard questions
  about daily activities, memory, and family
- Show typical features of MCI/AD language: word-finding pauses, repetitions,
  incomplete sentences, topic drift, and difficulty with complex constructions
- The patient turns average about 12-20 words, with considerable variation

Please generate ONE complete synthetic transcript of approximately 40 turns
(roughly 20 exchanges) that realistically mirrors this format and these
linguistic features. Use clearly fictional names (e.g. 'Dr Smith' and
'Patient: Margaret') and fictional details throughout. Do not base the
content on any real person.
"

Step 2: Generate the synthetic transcript

Code
# Step 2: Generate the synthetic transcript using the local LLM
# Everything runs on your machine — no data leaves

cat("Generating synthetic transcript...\n")
Generating synthetic transcript...
Code
synthetic_transcript <- ollamar::generate(
  model  = "llama3.2",
  prompt = structure_description,
  output = "text"
)

cat(synthetic_transcript)
Here is a synthetic transcript of approximately 40 turns, following the specified format:

@Participants: Dr. Johnson, Patient: Emily Wilson
@Date: March 12, 2023
@Location: Oakwood Medical Center

@Begin

*CLI: Dr. Johnson
Hello Emily, thank you for coming in today. Can you tell me a little bit about your daily routine?

*PAT: Emily Wilson
Uh, I try to get up early and... (..) do some light exercise.

*CLI: Dr. Johnson
That's great! What kind of exercise do you enjoy doing? (pauses)

*PAT: Emily Wilson
I like walking on the treadmill. It helps me clear my head.

*CLI: Dr. Johnson
That sounds wonderful. Do you find that it helps with your memory?

*PAT: Memory... um, yeah. I guess so. (laughs nervously)

*CLI: Dr. Johnson
Okay, let's talk about food a bit. What's something you particularly enjoy eating for breakfast?

*PAT: Breakfast... (pauses) toast with scrambled eggs.

*CLI: Dr. Johnson
Scrambled eggs, that's a good choice. Do you find that cooking is still something you can do easily?

*PAT: Cook? (laughs) Oh, yeah. I mean, I try.

*CLI: Dr. Johnson
Okay, let's move on to family. Who lives with you at home?

*PAT: My... uh, my husband. John. He's always taking care of me.

*CLI: That's lovely. Does he notice any changes in your behavior or cognitive function?

*PAT: Behavior? (pauses) No, I don't think so. At least, not that I know of.

*CLI: Okay, let's talk about work. What kind of job do you do?

*PAT: Work... um, I was an accountant. Now I'm retired.

*CLI: Ah, great! What's been the most challenging part of your retirement so far?

*PAT: Um... (pauses) well, I don't know if it's challenging exactly... (trails off)

*CLI: Okay, let's try another question. Can you tell me about a time when you had to make a difficult decision?

*PAT: Decision? (laughs) Oh, yeah... um...

*CLI: Dr. Johnson
Let's take a break for just a minute before we move on to the next question.

*PAT: Okay...

@End

A well-formed synthetic transcript generated by the model will look something like this (this example was produced by llama3.2 using the prompt above):

@Begin
@Participants: CLI Dr_Smith Clinician, PAT Margaret Patient
@Date: 15-MAR-2024
@Location: Memory Clinic, City Hospital
@Comment: Synthetic example transcript — not based on any real participant

*CLI: Good morning Margaret. How are you feeling today?
*PAT: Oh (.) good morning. I'm (.) I'm feeling alright I think. A bit tired.
*CLI: Did you sleep well last night?
*PAT: Well I (.) I tried to. I woke up a few times. I couldn't remember (.)
      I couldn't remember if I had taken my tablets.
*CLI: I see. What did you have for breakfast this morning?
*PAT: Breakfast. Yes. I had (.) um (.) I had some toast I think. Or was it
      cereal? My daughter usually (..) my daughter usually helps me in the
      morning. She's very good.
*CLI: That's lovely. How long has your daughter been helping you?
*PAT: Oh (.) a long time now. Since my husband. Since George pass- passed.
      That was (.) that was two years ago I think. Or maybe three.
*CLI: I'm sorry to hear that. Can you tell me what you did yesterday?
*PAT: Yesterday. Um (.) I think I watched the television. And I had a
      a walk I think. In the garden. I like the garden. I used to (.)
      I used to grow vegetables. Beans and things.
*CLI: That sounds nice. Do you still do any gardening?
*PAT: Not so much now. My hands (.) my hands aren't what they were.
      And I forget (.) I forget what I planted. I started to write things
      down but then I lose- I lose the notebook.
*CLI: What day of the week is it today Margaret?
*PAT: Today? Um (..) it's (.) is it Wednesday? I think Wednesday.
      No (..) I'm not sure actually. I thought it was Wednesday but
      my daughter said something about Thursday.
*CLI: It is Thursday, yes. That's alright. Can you tell me your address?
*PAT: My address. Yes. I live at (.) um (.) I live at twelve (..)
      twelve (..) the street name (.) oh I know this. It's (.) it starts
      with a B. Birch- Birch something.
*CLI: Take your time.
*PAT: Birchwood. I think Birchwood Lane. Number twelve. I've lived there
      thirty years. You'd think I'd know it.
*CLI: You're doing really well. How would you describe your memory lately?
*PAT: My memory. Well (.) not very good if I'm honest. I forget words
      a lot. I know what I want to say but the word (.) the word just
      doesn't come. It's very (.) it's very frustrating.
@End

Saving the synthetic transcript

Code
# Save the synthetic transcript to a file
# This is the file you will share with the cloud AI assistant

writeLines(
  synthetic_transcript,
  "tutorials/localllm_showcase/data/synthetic_transcript_example.cha"
)

message("Synthetic transcript saved.")

Generating multiple synthetic transcripts

For analysis code that processes multiple files, it is useful to generate several synthetic transcripts with some variation. We can do this in a loop:

Code
# Generate three synthetic transcripts with slightly different profiles
# to give the cloud AI a realistic multi-file example to work with

patient_profiles <- list(
  list(
    name   = "Margaret",
    age    = 74,
    notes  = "shows word-finding pauses, some repetition, generally coherent"
  ),
  list(
    name   = "Robert",
    age    = 81,
    notes  = "more frequent topic drift, longer filled pauses, shorter turns"
  ),
  list(
    name   = "Dorothy",
    age    = 68,
    notes  = "earlier stage, mostly fluent but occasional word-finding difficulty"
  )
)

dir.create("tutorials/localllm_showcase/data/synthetic_transcripts",
           recursive = TRUE, showWarnings = FALSE)

for (i in seq_along(patient_profiles)) {
  profile <- patient_profiles[[i]]

  prompt_i <- paste0(
    "Generate a synthetic clinical interview transcript for a patient with
    early-stage Alzheimer's disease. Use the CHAT format described below.
    The patient's name is ", profile$name, ", age ", profile$age, ".
    Language profile: ", profile$notes, ".

    Format rules:
    - Speaker turns marked *CLI: and *PAT:
    - Short pauses: (.) longer pauses: (..)
    - Incomplete words end with hyphen
    - Filled pauses (uh, um, er) written as spoken
    - Begin with @Begin and metadata; end with @End
    - Approximately 30-40 turns total
    - Use entirely fictional names, places, and details
    - Do NOT base on any real person

    Clinician is Dr Chen."
  )

  set.seed(i)  # for reproducibility of the prompt framing
  transcript_i <- ollamar::generate(
    model  = "llama3.2",
    prompt = prompt_i,
    output = "text"
  )

  out_file <- paste0(
    "tutorials/localllm_showcase/data/synthetic_transcripts/synthetic_",
    tolower(profile$name), ".cha"
  )
  writeLines(transcript_i, out_file)
  message("Saved: ", out_file)
}

Step 3: Getting analysis code from a cloud AI

With synthetic transcripts saved, the researcher can now open Claude (claude.ai), ChatGPT, or any other cloud AI assistant and share:

  1. One or more synthetic transcript files as attachments
  2. A request for R code to perform the analysis

A well-formed request might look like this:

Example prompt to Claude or ChatGPT

I have attached a synthetic example of the transcript format I am working with. These are clinical interview transcripts in CHAT notation. Please write R code that:

  1. Reads all .cha files from a folder called data/transcripts/
  2. Extracts all patient turns (lines starting with *PAT:)
  3. Calculates the mean number of words per patient turn for each file
  4. Counts the number of filled pauses (uh, um, er) per file
  5. Counts the number of incomplete words (words ending in -) per file
  6. Returns a tidy data frame with one row per file and these four summary columns

Please include comments explaining each step. The real transcripts have the same format as the attached example.

Step 4: The code returned by the cloud AI

The cloud AI will return code similar to the following. This code was generated by Claude based on the synthetic transcript above:

Code
# R code generated by Claude from the synthetic transcript proxy
# Run this on your real data locally — the real data never left your machine

library(stringr)
library(dplyr)
library(purrr)
library(readr)

# ---- Helper functions ----

# Extract all patient turns from a single CHAT transcript
extract_patient_turns <- function(file_path) {
  lines <- readLines(file_path, encoding = "UTF-8", warn = FALSE)
  # Patient turns start with *PAT:
  pat_lines <- lines[str_detect(lines, "^\\*PAT:")]
  # Remove the *PAT: prefix and clean whitespace
  str_remove(pat_lines, "^\\*PAT:\\s*") |>
    str_squish()
}

# Count words in a vector of utterances
count_words <- function(utterances) {
  str_count(utterances, "\\S+")
}

# Count filled pauses (uh, um, er as whole words)
count_filled_pauses <- function(utterances) {
  str_count(
    tolower(paste(utterances, collapse = " ")),
    "\\b(uh|um|er)\\b"
  )
}

# Count incomplete words (words ending with a hyphen)
count_incomplete_words <- function(utterances) {
  str_count(
    paste(utterances, collapse = " "),
    "\\b\\w+-(?=\\s|$)"
  )
}

# ---- Main analysis ----

# Get all .cha files in the transcripts folder
transcript_files <- list.files(
  path       = "data/transcripts",
  pattern    = "\\.cha$",
  full.names = TRUE
)

# Process each file and return a summary row
results <- map_dfr(transcript_files, function(f) {
  turns <- extract_patient_turns(f)

  if (length(turns) == 0) {
    return(tibble(
      file              = basename(f),
      n_turns           = 0L,
      mean_words_per_turn = NA_real_,
      n_filled_pauses   = NA_integer_,
      n_incomplete_words = NA_integer_
    ))
  }

  tibble(
    file                = basename(f),
    n_turns             = length(turns),
    mean_words_per_turn = round(mean(count_words(turns)), 2),
    n_filled_pauses     = count_filled_pauses(turns),
    n_incomplete_words  = count_incomplete_words(turns)
  )
})

print(results)
# A tibble: 0 × 0

Step 5: Run on real data

The researcher now runs this code locally against her real transcript files — simply changing "data/transcripts" to the path where her real .cha files are stored. The real data never left her machine at any point in the workflow.

Code
# Step 5: Run the generated code on real data
# Only change needed: point to the real data folder

real_transcript_files <- list.files(
  path       = "data/real_transcripts",  # <-- your real data folder
  pattern    = "\\.cha$",
  full.names = TRUE
)

# Re-run the analysis using the same helper functions defined above
real_results <- map_dfr(real_transcript_files, function(f) {
  turns <- extract_patient_turns(f)
  tibble(
    file                = basename(f),
    n_turns             = length(turns),
    mean_words_per_turn = round(mean(count_words(turns)), 2),
    n_filled_pauses     = count_filled_pauses(turns),
    n_incomplete_words  = count_incomplete_words(turns)
  )
})

# Save results
write_csv(real_results, "output/transcript_summary.csv")
print(real_results)
# A tibble: 0 × 0
Verifying the code works before using real data

Always test the AI-generated code on the synthetic data first:

Code
# Test on synthetic data — confirm no errors and sensible output
test_files <- list.files(
  path       = "tutorials/localllm_showcase/data/synthetic_transcripts",
  pattern    = "\\.cha$",
  full.names = TRUE
)

test_results <- map_dfr(test_files, function(f) {
  turns <- extract_patient_turns(f)
  tibble(
    file                = basename(f),
    n_turns             = length(turns),
    mean_words_per_turn = round(mean(count_words(turns)), 2),
    n_filled_pauses     = count_filled_pauses(turns),
    n_incomplete_words  = count_incomplete_words(turns)
  )
})

print(test_results)
# A tibble: 3 × 5
  file            n_turns mean_words_per_turn n_filled_pauses n_incomplete_words
  <chr>             <int>               <dbl>           <int>              <int>
1 synthetic_doro…       1                   1               0                  0
2 synthetic_marg…       0                 NaN               0                  0
3 synthetic_robe…       0                 NaN               0                  0
Code
# If this runs without errors and produces plausible numbers,
# the code is ready to run on the real data.

Part 2: Synthetic Tabular Data

Section Overview

What you will learn: How to describe the structure of a sensitive clinical dataset to a local LLM; how to prompt the model to generate a synthetic tabular dataset as a CSV; how to parse and validate the output; and how to use the synthetic table to obtain analysis code from a cloud AI assistant

The research scenario

The same research team has also collected a structured dataset accompanying the interviews. For each of the 40 participants, a research assistant has recorded demographic information and scores from three standardised cognitive assessments administered at baseline and at a 12-month follow-up. The dataset is stored as a CSV file with the following variables:

Variable Type Description
patient_id character Anonymised ID (e.g. AD_001)
age integer Age in years at baseline
sex character "F" or "M"
education_years integer Years of formal education
diagnosis character "MCI" or "AD"
mmse_baseline integer MMSE score at baseline (0–30)
mmse_followup integer MMSE score at 12-month follow-up
fluency_baseline integer Verbal fluency (words in 60 sec)
fluency_followup integer Verbal fluency at 12-month follow-up
depression_score integer GDS-15 depression screening score (0–15)
dropout character "yes" or "no" — whether participant withdrew before follow-up

The researcher wants the cloud AI to write R code that:

  1. Reads the CSV
  2. Computes change scores for MMSE and fluency (follow-up minus baseline)
  3. Compares change scores between MCI and AD groups
  4. Visualises the change scores with a grouped box plot

Step 1: Describe the data structure

Code
# Step 1: Describe the table structure to the local LLM
# No real participant data included

tabular_description <- "
I need you to generate a synthetic dataset as a CSV that I can use as a proxy
when asking a cloud AI to write R analysis code. The real dataset is sensitive
clinical data that I cannot share externally. The synthetic version must:

1. Have exactly this structure (column names and types must match exactly):
   - patient_id: character, format 'AD_001' to 'AD_040'
   - age: integer, range 65-88, approximately normally distributed, mean ~74
   - sex: character, 'F' or 'M', approximately 60% female
   - education_years: integer, range 8-20, mean ~13
   - diagnosis: character, 'MCI' or 'AD', approximately 55% MCI
   - mmse_baseline: integer, range 18-30 for MCI (mean ~26), 12-24 for AD (mean ~20)
   - mmse_followup: integer, generally 1-4 points lower than baseline;
     about 15% of participants show no change or slight improvement;
     participants who dropped out have NA
   - fluency_baseline: integer, range 8-22 for MCI (mean ~15), 5-16 for AD (mean ~11)
   - fluency_followup: integer, generally 1-3 lower than baseline;
     participants who dropped out have NA
   - depression_score: integer, range 0-15, mean ~4, right-skewed
   - dropout: character, 'yes' or 'no'; approximately 20% dropout;
     dropout participants have NA for all followup variables

2. Have 40 rows (one per participant)
3. Show realistic correlations (e.g. older age and lower education tend to
   co-occur with AD rather than MCI; higher depression associated with lower MMSE)
4. Contain NO real participant data — all values must be entirely fabricated
5. Be returned as a valid CSV with a header row and no row numbers

Return ONLY the CSV content. No explanation, no markdown code blocks,
no preamble. Just the raw CSV starting with the header line.
"

Step 2: Generate the synthetic table

Code
# Step 2: Generate the synthetic CSV using the local LLM

cat("Generating synthetic dataset...\n")
Generating synthetic dataset...
Code
synthetic_csv_raw <- ollamar::generate(
  model  = "llama3.2",
  prompt = tabular_description,
  output = "text"
)

# The model may wrap output in markdown code blocks — strip them if present
synthetic_csv_clean <- synthetic_csv_raw |>
  stringr::str_remove("^```[a-z]*\\n?") |>
  stringr::str_remove("```\\s*$") |>
  stringr::str_trim(side = "both")

cat(substr(synthetic_csv_clean, 1, 500))     # preview first 500 characters
patient_id,age,sex,education_years,diagnosis,mmse_baseline,mmse_followup,fluency_baseline,fluency_followup,depression_score,dropout
AD_001,77,F,15,MCI,28,23,19,17,10,no
AD_002,M,M,18,AD,22,20,12,11,6,no
AD_003,F,M,12,AD,24,21,16,14,13,yes
AD_004,F,M,18,MCI,25,22,15,13,8,yes
AD_005,M,F,10,MCI,27,24,19,17,9,yes
AD_006,F,M,20,MCI,28,25,18,16,11,yes
AD_007,M,F,14,AD,23,21,14,12,7,no
AD_008,M,F,16,MCI,26,24,20,18,10,yes
AD_009,F,M,19,AD,24,22,15,13,5,no
AD_010,M,F,17,AD,25,23,17,15,12,yes
AD_011,M,F,

Parsing and validating the synthetic table

Code
# Parse the CSV string into a data frame
synthetic_df <- readr::read_csv(
  I(synthetic_csv_clean),   # I() tells read_csv to treat the string as file content
  show_col_types = FALSE
)

# Validate structure
cat("Dimensions:", nrow(synthetic_df), "rows x", ncol(synthetic_df), "columns\n")
Dimensions: 20 rows x 11 columns
Code
cat("Column names:", paste(names(synthetic_df), collapse = ", "), "\n\n")
Column names: patient_id, age, sex, education_years, diagnosis, mmse_baseline, mmse_followup, fluency_baseline, fluency_followup, depression_score, dropout 
Code
# Check for expected columns
expected_cols <- c(
  "patient_id", "age", "sex", "education_years", "diagnosis",
  "mmse_baseline", "mmse_followup", "fluency_baseline",
  "fluency_followup", "depression_score", "dropout"
)

missing_cols <- setdiff(expected_cols, names(synthetic_df))
if (length(missing_cols) > 0) {
  warning("Missing columns: ", paste(missing_cols, collapse = ", "))
} else {
  cat("All expected columns present.\n")
}
All expected columns present.
Code
# Quick summary
summary(synthetic_df)
  patient_id            age                sex            education_years
 Length:20          Length:20          Length:20          Min.   :10.0   
 Class :character   Class :character   Class :character   1st Qu.:12.0   
 Mode  :character   Mode  :character   Mode  :character   Median :15.5   
                                                          Mean   :15.1   
                                                          3rd Qu.:18.0   
                                                          Max.   :20.0   
  diagnosis         mmse_baseline  mmse_followup  fluency_baseline
 Length:20          Min.   :22.0   Min.   :20.0   Min.   :12.0    
 Class :character   1st Qu.:24.0   1st Qu.:22.0   1st Qu.:15.0    
 Mode  :character   Median :25.5   Median :23.5   Median :17.5    
                    Mean   :25.6   Mean   :23.7   Mean   :17.4    
                    3rd Qu.:27.2   3rd Qu.:25.0   3rd Qu.:19.2    
                    Max.   :30.0   Max.   :29.0   Max.   :22.0    
 fluency_followup depression_score   dropout         
 Min.   :11.0     Min.   : 5.00    Length:20         
 1st Qu.:13.8     1st Qu.: 7.75    Class :character  
 Median :15.5     Median : 9.50    Mode  :character  
 Mean   :15.8     Mean   : 9.70                      
 3rd Qu.:18.0     3rd Qu.:12.00                      
 Max.   :21.0     Max.   :15.00                      

If the model produces malformed CSV or missing columns, a more detailed prompt usually resolves it. See the prompt refinement tip below.

If the model returns malformed CSV

Small models occasionally produce slightly malformed output — extra text, misaligned columns, or incorrect NA representation. Two strategies help:

1. Be more explicit about output format:

Code
# More explicit prompt additions to improve CSV quality
format_enforcement <- "
IMPORTANT OUTPUT RULES:
- Return ONLY raw CSV text
- First line must be the header row
- Use comma as delimiter
- Use NA (not 'N/A', 'missing', or empty) for missing values
- Do not include row numbers or an index column
- Do not wrap in markdown or code blocks
- Do not add any text before or after the CSV
"

tabular_description_strict <- paste(tabular_description, format_enforcement)

2. Use a chat with an explicit system prompt:

Code
# Alternatively, use chat() with a system prompt enforcing output format
messages <- ollamar::create_message(
  role    = "system",
  content = "You are a data generation assistant. You return ONLY raw CSV text
             with no explanation, no markdown, and no extra text of any kind.
             The very first character of your response must be the first
             character of the CSV header row."
)
messages <- ollamar::append_message(
  role     = "user",
  content  = tabular_description,
  messages = messages
)

synthetic_csv_raw2 <- ollamar::chat(
  model    = "llama3.2",
  messages = messages,
  output   = "text"
)

Saving the synthetic table

Code
dir.create(here::here("tutorials/localllm_showcase/data"), recursive = TRUE, showWarnings = FALSE)

readr::write_csv(
  synthetic_df,
  here::here("tutorials/localllm_showcase/data/synthetic_clinical_data.csv")
)

message("Synthetic dataset saved to tutorials/localllm_showcase/data/synthetic_clinical_data.csv")
message("This file is safe to share with a cloud AI assistant.")

Step 3: Getting analysis code from a cloud AI

The researcher attaches synthetic_clinical_data.csv to a conversation in Claude or ChatGPT and submits a request like the following:

Example prompt to Claude or ChatGPT

I have attached a synthetic dataset that has exactly the same structure as the clinical data I need to analyse. Please write R code that:

  1. Reads a CSV file with the same column structure as the attached file from "data/clinical_data.csv"
  2. Computes change scores: mmse_change = mmse_followup - mmse_baseline and fluency_change = fluency_followup - fluency_baseline
  3. Excludes participants with dropout == "yes" from the change score analysis
  4. Runs a Wilcoxon rank-sum test comparing mmse_change between "MCI" and "AD" groups
  5. Runs the same test for fluency_change
  6. Creates a grouped box plot (using ggplot2) showing both change scores side by side, grouped by diagnosis
  7. Prints a summary table of means and standard deviations for each change score by group

Please include comments explaining each step.

Step 4: The code returned by the cloud AI

Code
# R code generated by Claude from the synthetic dataset proxy
# Run locally on your real data — the real data never left your machine

library(dplyr)
library(ggplot2)
library(tidyr)
library(readr)

# ---- Load data ----
dat <- read_csv(here::here("tutorials/localllm_showcase/data/clinical_data.csv"), show_col_types = FALSE)

# ---- Compute change scores ----
dat <- dat |>
  mutate(
    mmse_change    = mmse_followup - mmse_baseline,
    fluency_change = fluency_followup - fluency_baseline
  )

# ---- Exclude dropouts for change score analysis ----
dat_completers <- dat |>
  filter(dropout == "no")

cat("Completers:", nrow(dat_completers), "of", nrow(dat), "participants\n")
Completers: 35 of 40 participants
Code
# ---- Wilcoxon rank-sum tests ----
# MMSE change: MCI vs AD
mmse_test <- wilcox.test(
  mmse_change ~ diagnosis,
  data = dat_completers,
  exact = FALSE
)
cat("\nMMSE change — Wilcoxon test:\n")

MMSE change — Wilcoxon test:
Code
print(mmse_test)

    Wilcoxon rank sum test with continuity correction

data:  mmse_change by diagnosis
W = 6, p-value = 0.0000003
alternative hypothesis: true location shift is not equal to 0
Code
# Fluency change: MCI vs AD
fluency_test <- wilcox.test(
  fluency_change ~ diagnosis,
  data = dat_completers,
  exact = FALSE
)
cat("\nFluency change — Wilcoxon test:\n")

Fluency change — Wilcoxon test:
Code
print(fluency_test)

    Wilcoxon rank sum test with continuity correction

data:  fluency_change by diagnosis
W = 55, p-value = 0.0002
alternative hypothesis: true location shift is not equal to 0
Code
# ---- Summary table ----
summary_tbl <- dat_completers |>
  group_by(diagnosis) |>
  summarise(
    n                  = n(),
    mmse_change_mean   = round(mean(mmse_change, na.rm = TRUE), 2),
    mmse_change_sd     = round(sd(mmse_change, na.rm = TRUE), 2),
    fluency_change_mean = round(mean(fluency_change, na.rm = TRUE), 2),
    fluency_change_sd   = round(sd(fluency_change, na.rm = TRUE), 2),
    .groups = "drop"
  )

print(summary_tbl)
# A tibble: 2 × 6
  diagnosis     n mmse_change_mean mmse_change_sd fluency_change_mean
  <chr>     <int>            <dbl>          <dbl>               <dbl>
1 AD           15            -2.87           0.35               -1.93
2 MCI          20            -1.3            0.47               -1.3 
# ℹ 1 more variable: fluency_change_sd <dbl>
Code
# ---- Grouped box plot ----
# Reshape to long format for side-by-side plotting
dat_long <- dat_completers |>
  select(patient_id, diagnosis, mmse_change, fluency_change) |>
  pivot_longer(
    cols      = c(mmse_change, fluency_change),
    names_to  = "measure",
    values_to = "change"
  ) |>
  mutate(
    measure = recode(measure,
      mmse_change    = "MMSE change",
      fluency_change = "Fluency change"
    )
  )

p <- ggplot(dat_long, aes(x = diagnosis, y = change, fill = diagnosis)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 2) +
  geom_jitter(width = 0.15, alpha = 0.4, size = 1.5) +
  facet_wrap(~ measure, scales = "free_y") +
  scale_fill_manual(values = c("MCI" = "#4E79A7", "AD" = "#F28E2B")) +
  labs(
    title    = "Cognitive change over 12 months by diagnosis group",
    subtitle = "Negative values indicate decline; completers only",
    x        = "Diagnosis",
    y        = "Change score (follow-up minus baseline)",
    fill     = "Diagnosis",
    caption  = "Wilcoxon rank-sum test; box = IQR; dots = individual participants"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

print(p)

Code
# Save figure
ggsave(here::here("tutorials/localllm_showcase/images/change_score_boxplot.png", p,
       width = 8, height = 5, dpi = 300)

Step 5: Run on real data

Code
# Step 5: Run the generated code on real data
# Only change needed: the file path

# Replace the path in the read_csv() call above:
# FROM: dat <- read_csv(here::here("tutorials/localllm_showcase/data/clinical_data.csv"), ...)
# TO:   dat <- read_csv(here::here("data/YOUR_confidential_patient_data.csv"), ...)

# All other code runs identically — the real data has the same column structure
# as the synthetic proxy, so every downstream step works without modification.

# Confirm the real data has the expected structure before running:
real_dat <- read_csv(here::here("tutorials/localllm_showcase/data/clinical_data.csv"),
                     show_col_types = FALSE)
cat("Real data dimensions:", nrow(real_dat), "x", ncol(real_dat), "\n")
Real data dimensions: 40 x 11 
Code
cat("Columns:", paste(names(real_dat), collapse = ", "), "\n")
Columns: patient_id, age, sex, education_years, diagnosis, mmse_baseline, mmse_followup, fluency_baseline, fluency_followup, depression_score, dropout 
Code
stopifnot(all(expected_cols %in% names(real_dat)))
cat("Structure verified. Safe to run the full analysis.\n")
Structure verified. Safe to run the full analysis.

Iterating on the Workflow

Section Overview

What you will learn: How to handle the case where the AI-generated code does not quite match your real data; how to iterate without ever exposing real data; and how to manage a multi-step analysis with the local LLM

In practice, the code returned by the cloud AI will sometimes need minor adjustments. Perhaps the column names in the real data differ slightly from what you described, or the code makes assumptions about data types that do not hold. The key principle is to iterate using synthetic data only — never share the error message if it contains real data values.

Handling code that does not quite work

If the code fails on your real data, follow this sequence:

Code
# If the code fails on real data, debug using synthetic data only

# 1. Reproduce the error on synthetic data
#    (If it does not reproduce, the difference is in the real data structure)

# 2. If the error involves a specific column name or value that differs
#    in your real data, describe the difference in abstract terms to the
#    cloud AI — never paste in real values

# Example: "The real data uses 'Male' and 'Female' instead of 'M' and 'F'
#           for the sex variable. Please update the code accordingly."

# 3. Alternatively, fix the discrepancy in a pre-processing step that
#    runs locally on the real data before the AI-generated code

# Pre-processing adapter — runs locally, never seen by cloud AI
preprocess_real_data <- function(dat) {
  dat |>
    mutate(
      # Harmonise sex coding to match what the AI code expects
      sex = case_when(
        tolower(sex) %in% c("female", "f", "woman") ~ "F",
        tolower(sex) %in% c("male", "m", "man")     ~ "M",
        TRUE ~ sex
      ),
      # Harmonise diagnosis coding
      diagnosis = case_when(
        str_detect(diagnosis, regex("mild cognitive", ignore_case = TRUE)) ~ "MCI",
        str_detect(diagnosis, regex("alzheimer", ignore_case = TRUE))      ~ "AD",
        TRUE ~ diagnosis
      )
    )
}

# Apply before running the AI-generated analysis code
real_dat_clean <- preprocess_real_data(real_dat)

Asking for multiple scripts in one session

Once the synthetic proxy is established, you can use it for multiple analysis requests in the same cloud AI conversation — no need to re-upload:

Efficient multi-request workflow

First request: “I’ve attached a synthetic dataset. Please write code to compute change scores and run a Wilcoxon test as described above.”

(Receive code, test on synthetic data, run on real data)

Second request (same conversation): “Using the same dataset structure, please now write code to run a logistic regression predicting dropout from baseline MMSE, age, education years, and diagnosis. Include model diagnostics.”

(Cloud AI already knows the data structure — no re-upload needed)

Each request in the same conversation leverages the cloud AI’s memory of the synthetic data structure. You only upload the synthetic file once per session.


A Note on Data Synthesis Quality

Section Overview

What you will learn: How to assess whether a synthetic dataset is a good proxy; when LLM-generated synthesis is and is not appropriate; and how to supplement LLM generation with statistical synthesis tools for tabular data

What makes a good proxy?

A synthetic proxy dataset is fit for purpose when:

  1. The structure matches exactly — same column names, same data types, same file format
  2. The value ranges are realistic — the AI-generated code should not need to handle edge cases that do not appear in the synthetic data but do in the real data
  3. The statistical properties are plausible — if the real data has correlated variables (e.g. older patients have lower MMSE), the synthetic data should too, or the generated analysis code may not handle the patterns correctly
  4. Missing data patterns are represented — if the real data has dropouts or missing values, the synthetic data must include them so the code handles them correctly

When LLM synthesis is sufficient

For the purposes of code generation, LLM synthesis is usually sufficient. The cloud AI needs to understand the data structure well enough to write correct R syntax — it does not need a statistically precise replica of the real data.

The main risk is that the generated code makes implicit assumptions based on the synthetic data that do not hold in the real data. For example, if the synthetic data has no missing values in a column that the real data does have missing values in, the generated code may not include na.rm = TRUE in the right places. This is why the validation step before running on real data is important.

When to use statistical synthesis instead

For tabular data where statistical fidelity matters — for example, when checking that a proposed analysis has adequate power, or when sharing a dataset for external replication — purpose-built synthesis packages produce more statistically faithful output than LLMs:

Code
# For statistically faithful tabular synthesis, consider synthpop
# install.packages("synthpop")
library(synthpop)

# Generate a statistically faithful synthetic version of the real dataset
# This runs locally — the real data still never leaves your machine
synth_result <- synthpop::syn(
  real_dat,           # your real data frame
  seed = 42           # for reproducibility
)
CAUTION: Your data set has fewer observations (40) than we advise.
We suggest that there should be at least 210 observations
(100 + 10 * no. of variables used in modelling the data).
Please check your synthetic data carefully with functions
compare(), utility.tab(), and utility.gen().


Variable(s): patient_id, sex, diagnosis, dropout have been changed for synthesis from character to factor.

Synthesis
-----------
 patient_id age sex education_years diagnosis mmse_baseline mmse_followup fluency_baseline fluency_followup depression_score
 dropout
Code
# Extract the synthesised data frame
statistical_synthetic <- synth_result$syn

# This is more statistically faithful than LLM generation,
# but requires access to the real data to run.
# Use for: power analysis, methods validation, external sharing
# Use LLM generation for: getting code quickly without loading real data at all
Combining both approaches

The two approaches are complementary:

  • Use LLM generation when you want to describe the data structure without loading the real data (e.g. at the start of a project, or on a different machine)
  • Use statistical synthesis when you need a statistically faithful copy for quantitative validation or sharing with collaborators

In both cases, the real data stays local.


Summary

This showcase has demonstrated a complete privacy-preserving workflow for using cloud AI assistants to write analysis code for sensitive research data:

The core idea is that cloud AI assistants need to understand your data structure, not your data content, in order to write useful code. A synthetic proxy that mirrors the structure is sufficient for this purpose.

For transcript data, a local LLM can generate realistic synthetic transcripts from a textual description of the format and linguistic features — no real transcript content is needed in the prompt.

For tabular data, a local LLM can generate a synthetic CSV from a description of variable names, types, and value ranges — no real data values are needed in the prompt.

The five-step workflow — describe locally, generate locally, upload synthetic, receive code, run locally — ensures that sensitive participant data remains on the researcher’s own machine at every stage.

The local LLM is the key enabler of this workflow: it allows the data generation step to happen without any data leaving the machine, even for the synthetic data generation itself. The description of your sensitive data structure is itself information that should be kept local.


Citation & Session Info

Schweinberger, Martin. 2026. Privacy-Preserving Analysis with Local LLMs: Generating Synthetic Data Proxies. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/localllm_showcase/localllm_showcase.html (Version 2026.05.01).

@manual{schweinberger2026localllm_showcase,
  author       = {Schweinberger, Martin},
  title        = {Privacy-Preserving Analysis with Local LLMs: Generating Synthetic Data Proxies},
  note         = {tutorials/localllm_showcase/localllm_showcase.html},
  year         = {2026},
  organization = {The University of Queensland, Australia. School of Languages and Cultures},
  address      = {Brisbane},
  edition      = {2026.05.01}
}
AI Transparency Statement

This showcase was written with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to draft and structure the showcase, including all R code, workflow descriptions, and example outputs. All content was reviewed and approved by Martin Schweinberger, who takes full responsibility for its accuracy.

Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] synthpop_1.9-2   tidyr_1.3.2      ggplot2_4.0.2    jsonlite_2.0.0  
 [5] flextable_0.9.11 purrr_1.2.1      readr_2.1.5      stringr_1.6.0   
 [9] tibble_3.3.1     dplyr_1.2.0      ollamar_1.2.2    checkdown_0.0.13

loaded via a namespace (and not attached):
  [1] cmm_1.0                 httr2_1.2.2             sandwich_3.1-1         
  [4] rlang_1.1.7             magrittr_2.0.4          multcomp_1.4-28        
  [7] matrixStats_1.5.0       e1071_1.7-16            polspline_1.1.25       
 [10] compiler_4.4.2          systemfonts_1.3.1       vctrs_0.7.2            
 [13] rmutil_1.1.10           pkgconfig_2.0.3         crayon_1.5.3           
 [16] fastmap_1.2.0           labeling_0.4.3          utf8_1.2.6             
 [19] rmarkdown_2.30          markdown_2.0            tzdb_0.5.0             
 [22] ragg_1.5.1              bit_4.6.0               xfun_0.56              
 [25] modeltools_0.2-23       randomForest_4.7-1.2    uuid_1.2-1             
 [28] parallel_4.4.2          R6_2.6.1                Rsolnp_2.0.1           
 [31] coin_1.4-3              stringi_1.8.7           RColorBrewer_1.1-3     
 [34] ranger_0.17.0           parallelly_1.42.0       rpart_4.1.23           
 [37] numDeriv_2016.8-1.1     Rcpp_1.1.1              knitr_1.51             
 [40] future.apply_1.11.3     zoo_1.8-13              Matrix_1.7-2           
 [43] splines_4.4.2           nnet_7.3-19             tidyselect_1.2.1       
 [46] rstudioapi_0.17.1       broman_0.92             yaml_2.3.10            
 [49] codetools_0.2-20        curl_7.0.0              listenv_0.9.1          
 [52] lattice_0.22-6          plyr_1.8.9              withr_3.0.2            
 [55] S7_0.2.1                askpass_1.2.1           evaluate_1.0.5         
 [58] foreign_0.8-87          future_1.34.0           survival_3.7-0         
 [61] proxy_0.4-27            zip_2.3.2               xml2_1.3.6             
 [64] pillar_1.11.1           BiocManager_1.30.27     party_1.3-18           
 [67] KernSmooth_2.23-24      renv_1.1.7              stats4_4.4.2           
 [70] generics_0.1.4          vroom_1.7.0             rprojroot_2.1.1        
 [73] truncnorm_1.0-9         hms_1.1.4               scales_1.4.0           
 [76] globals_0.16.3          class_7.3-22            glue_1.8.0             
 [79] gdtools_0.5.0           tools_4.4.2             data.table_1.17.0      
 [82] forcats_1.0.0           mvtnorm_1.3-3           grid_4.4.2             
 [85] libcoin_1.0-10          mipfp_3.2.1             patchwork_1.3.0        
 [88] proto_1.0.0             cli_3.6.5               rappdirs_0.3.3         
 [91] textshaping_1.0.0       officer_0.7.3           fontBitstreamVera_0.1.1
 [94] strucchange_1.5-4       gtable_0.3.6            digest_0.6.39          
 [97] fontquiver_0.2.1        classInt_0.4-11         TH.data_1.1-3          
[100] htmlwidgets_1.6.4       farver_2.1.2            htmltools_0.5.9        
[103] lifecycle_1.0.5         here_1.0.2              fontLiberation_0.1.0   
[106] openssl_2.3.2           bit64_4.6.0-1           MASS_7.3-61            

Back to top

Back to LADAL home


References